3.1 - Goal - Learning a Single Economic Task

Our first foray into reinforcement learning will be a foundational "Hello, World!" task. The objective is to create the simplest possible environment where an agent can learn a single, vital economic behavior: producing worker units.

By isolating this one task, we can verify that our entire RL architecture—from the SC2GymEnv wrapper to the training script—is functioning correctly before adding further complexity.

Problem Definition

We will train an agent whose sole purpose is to learn the correct policy for building SCVs from a Command Center.

Success Criteria:
- The agent must learn to check for sufficient resources (minerals >= 50).
- The agent must learn to check for an available production facility (idle Command Center).
- The agent must learn to correlate the correct game state with the "build worker" action.
- The episode must terminate successfully upon reaching a target of 20 workers.

Environment Design

To facilitate this learning, we will design a minimal environment with precisely defined observation, action, and reward structures.

1. Observation Space (The Agent's Input)

The state provided to the agent must contain only the information necessary to make the "build worker" decision. All values are normalized to a [0.0, 1.0] range to improve neural network stability.

Index	Feature	Normalization Factor	Purpose
`0`	`self.minerals`	`1000`	"Can I afford the action?"
`1`	`self.workers.amount`	`50`	"How close am I to the goal?"
`2`	`self.supply_left`	`20`	"Do I have the capacity for a new unit?"

2. Action Space (The Agent's Output)

The agent has a discrete, binary choice on each decision step.

Action Value	Agent's Intent
`0`	Do Nothing
`1`	Attempt to build an SCV

3. Reward Function (The Learning Signal)

The reward function is engineered to provide clear and immediate feedback, guiding the agent toward the correct behavior.

Triggering Condition	Reward Value	Signal Type	Purpose
"Build" action is taken and is possible	`+5`	Sparse	Strongly reinforce the correct decision.
"Build" action is taken but is impossible	`-5`	Sparse	Strongly penalize impossible actions (teaches game rules).
Every decision step	`+ (worker_count * 0.1)`	Dense	Continuously encourage progress toward the goal.

Key Learning Challenges for the Agent

Credit Assignment: The agent must learn that the +5 reward is directly caused by its "Build" action when resources are sufficient.
Constraint Learning: The agent must learn to avoid the -5 penalty by paying attention to the minerals and supply_left observations, effectively learning the game's constraints.
Temporal Progression: The dense reward for worker_count encourages the agent to not just "do nothing," but to actively pursue the goal over time.

This tightly scoped problem provides a perfect, verifiable test case for our RL framework. In the next section, we will translate this design into code.

Problem Definition​

Environment Design​

Key Learning Challenges for the Agent​

Problem Definition

Environment Design

Key Learning Challenges for the Agent